1 Introduction and BBC iPlayer streaming data

2 Assignment

3 Learning Objectives

3.1 How to get the most out of this exercise

4 Cleaned Data

I have already processed and cleaned the original view data. In this step you will first generate a user-based database which we will use to train clustering algorithms to identify meaningful clusters in the data.

Let’s load the cleaned data and investigate what’s in the data. See below for column descriptions.

cleaned_BBC_Data <- read_csv(file="Results_Step1.csv",col_names = TRUE)
## Rows: 313256 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): user_id, program_id, series_id, genre, streaming_id, weekend, time...
## dbl  (9): prog_duration_min, time_viewed_min, duration_more_30s, time_viewed...
## dttm (1): start_date_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(dplyr)
glimpse(cleaned_BBC_Data) 
## Rows: 313,256
## Columns: 17
## $ user_id                   <chr> "cd2006", "cd2006", "cd2006", "cd2006", "cd2…
## $ program_id                <chr> "b8fbf2", "e2f113", "0e0916", "ca03b9", "cfe…
## $ series_id                 <chr> "e0480e", "933a1b", "b68e79", "5d0813", "eba…
## $ genre                     <chr> "Comedy", "Factual", "Entertainment", "Sport…
## $ start_date_time           <dttm> 2017-01-19 22:17:04, 2017-02-14 19:12:36, 2…
## $ streaming_id              <chr> "1484864257965_1", "1487099603980_1", "14847…
## $ prog_duration_min         <dbl> 1.850000, 0.500000, 1.366667, 1.616667, 8.50…
## $ time_viewed_min           <dbl> 1.85000000, 0.49908333, 1.36666667, 1.616666…
## $ duration_more_30s         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ time_viewed_more_5s       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ percentage_program_viewed <dbl> 1.000000000, 0.998166667, 1.000000000, 1.000…
## $ watched_more_60_percent   <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,…
## $ month                     <dbl> 1, 2, 1, 2, 3, 2, 4, 3, 4, 3, 3, 4, 3, 3, 2,…
## $ day                       <dbl> 5, 3, 4, 1, 1, 3, 1, 6, 7, 7, 1, 5, 7, 3, 2,…
## $ hour                      <dbl> 22, 19, 21, 14, 20, 19, 20, 21, 21, 20, 18, …
## $ weekend                   <chr> "weekday", "weekday", "weekday", "weekend", …
## $ time_of_day               <chr> "Evening", "Evening", "Evening", "Day", "Eve…

Before we proceed let’s consider the usage in January only.

cleaned_BBC_Data<-filter(cleaned_BBC_Data,month==1)

5 User based data

We will try to create meaningful customer segments that describe users of the BBC iPlayer service. First we need to change the data to user based and generate a summary of their usage.

5.1 Data format

The data is presented to us in an event-based format (every row captures a viewing event). However we need to detect the differences between the general watching habits of users.

How can you convert the current date set to a customer-based dataset (i.e., summarizes the general watching habits of each user). In what dimensions could BBC iPlayer users be differentiated? Can you come up with variables that capture these? Discuss these issues with your group and determine a strategy on how data must be processed

5.2 Feature Engineering

For the workshop let’s generate the following variables for each user.

  1. Total number of shows watched and ii. Total time spent watching shows on iPlayer by each user in the data
userData<-cleaned_BBC_Data %>% group_by(user_id) %>% summarise(noShows=n(), total_Time=sum(time_viewed_min)) 
  1. Proportion of shows watched during the weekend for each user.
#Let's find the number of shows on weekend and weekdays
userData2<-cleaned_BBC_Data %>% group_by(user_id,weekend) %>% summarise(noShows=n())
## `summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.
#Let's find percentage in weekend and weekday
userData3 = userData2%>% group_by(user_id) %>% mutate(weight_pct = noShows / sum(noShows))

#Let's create a data frame with each user in a row.
userData3<-select (userData3,-noShows)
userData3<-userData3%>% spread(weekend,weight_pct,fill=0) %>%as.data.frame()
#Let's merge the final result with the data frame from the previous step.
userdatall<-left_join(userData,userData3,by="user_id")
  1. Proportion of shows watched during different times of day for each user.
#Code in this block follows the same steps above.
userData2<-cleaned_BBC_Data %>% group_by(user_id,time_of_day) %>% summarise(noShows=n()) %>% mutate(weight_pct = noShows / sum(noShows))
## `summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.
userData4<-select (userData2,-c(noShows))
userData4<-spread(userData4,time_of_day,weight_pct,fill=0)

userdatall<-left_join(userdatall,userData4,by="user_id")

Question 1. Find the proportion of shows watched in each genre by each user. Your code below.

#Your code here

#add your results to the data frame userdatall

#Let's find the number of shows watched by each genre
userData4<-cleaned_BBC_Data %>% group_by(user_id,genre) %>% summarise(noShows=n())
## `summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.
#Let's find percentage in each genre
userData5 = userData4%>% group_by(user_id) %>% mutate(weight_pct = noShows / sum(noShows))

#Let's create a data frame with each user in a row.
userData5<-select (userData5,-noShows)
userData5<-userData5 %>% spread(genre,weight_pct,fill=0) %>% as.data.frame()
#Let's merge the final result with the data frame from the previous step.
userdatall<-left_join(userdatall,userData5,by="user_id")

Question 2. Add one more variable of your own. Describe why this might be useful for differentating viewers in 1 or 2 lines. Your code below.

#Your code here

#add your results to the data frame userdatall
#Let's add average percentage of program viewed for each user
userData6 <- cleaned_BBC_Data %>% group_by(user_id) %>% summarise(avg_view_pct=mean(percentage_program_viewed))

userdatall<-left_join(userdatall,userData6,by="user_id")

6 Visualizing user-based data

Next visualize the information captured in the user based data. Let’s start with the correlations.

library("GGally")
userdatall %>% 
  select(-user_id) %>% #keep Y variable last
  ggcorr(method = c("pairwise", "pearson"), layout.exp = 3,label_round=2, label = TRUE,label_size = 2,hjust = 1)

Question 3. Which variables are most correlated? What’s the implication of this for clustering?

The variables that are most correlated are those that we would expect. Firstly, there is a perfect negative correlation between ‘weekend’ and ‘weekday’, as these variables represent the percentage of the user’s watch time that comes on the weekend and during the week respectively. The higher the percentage of a user’s watchtime that occurs on the weekend, the lower the watchtime percentage will be during the week (and vice versa) - therefore, these variables have to have a perfectly negative correlation.

Additionally, there is a very strong positive correlation between ‘noShows’ and ‘total_Time’ - this is because, in general, the more different shows that a user watches, the longer we expect their total watch time to be. This however is not a perfect correlation, as there will be people who switch between shows very quickly, trying to find one they really like.

There are also fairly strong negative correlations between the different parameters for the time of day (‘Evening’, ‘Day’ & ‘Afternoon’) - this is also expected, as the larger proportion of watch time that occurs in the evening, the less that is likely to be done in the day and afternoon. We can also see some correlations between different show ‘types’ - for example there are notable negative correlations between genres like ‘Sport’ and ‘Drama’, as well as ‘Factual’ and ‘Drama’. This makes sense, as people who like watching lots of drama are probably less likely to watch sport and documentaries. Most of the other variables exhibit fairly weak correlations.

The correlation between ‘weekend’ and ‘weekday’ as well as ‘total_Time’ and ‘noShows’ should be noted when we do our cluster analysis. Because these variables are close to being perfectly correlated, they essentially represent the same concept and we should therefore not include both of the variables in our cluster analysis, as it may provide higher weight to the concept than it should. Both ‘weekend’ and ‘weekday’ represent the concept of what time during the week the user watches iPlayer - this concept can be incorporated into our cluster analysis with just one of these variables - and therefore, we should not include both as this will provide unnescessary weighting to this concept.

Question 4. Investigate the distribution of noShows and total_Time using box-whisker plots and histograms. Explain what you observe in 1-2 sentences. Are you worried about outliers?

#Insert your code here:

ggplot(data = userdatall, aes(x = "", y = noShows)) + 
  geom_boxplot() +
  labs(title = "Boxplot for Number of Shows watched", x = "Number of Shows", y = "Value")+
  scale_color_manual(values=c("#56B4E9"))+
  NULL

ggplot(userdatall, aes(x = noShows))+
  geom_histogram(bins = 100, color = "black", fill = "#56B4E9") +
  labs(title = "Histogram for Number of Shows watched", x = "Number of Shows", y = "Count")+
  NULL

ggplot(data = userdatall, aes(x = "", y = total_Time)) + 
  geom_boxplot() +
  labs(title = "Boxplot for Total Time (iPlayer) watched", x = "Total Time (mins) ", y = "Value")+
  scale_color_manual(values=c("#56B4E9"))+
  NULL

ggplot(userdatall, aes(x = total_Time))+
  geom_histogram(bins = 100, color = "black", fill = "#56B4E9") +
  labs(title = "Histogram for Total Time (iPlayer) watched", x = "Total Time (mins)", y = "Count")+
  NULL

The boxplots and histograms above show us that the distributions for both variables (noShows & total_Time) are very positively skewed, which means there are significant positive outliers in both of these distributions/variables. The boxplots in particular highlight these huge negative outliers, which may in fact be erroneous and give the distributions this very skewed shape. The effect of these outliers on our results really depends on the method of clustering used, but in the case of k-means clustering, we should be ‘worried’ about these results. This is because the outliers can affect the results by essentially shifting the centres of the clusters.

6.1 Delete infrequent users

Delete the records for users whose total view time is less than 5 minutes and who views 5 or fewer programs. These users are not very likely to be informative for clustering purposes. Or we can view these users as a ``low-engagement’’ cluster.

userdata_red<-userdatall%>%filter(total_Time>=5)%>%filter(noShows>=5)
ggplot(userdata_red, aes(x=total_Time)) +geom_histogram(binwidth=25)+labs(x="Total Time Watched (mins)", y= "Count")

glimpse(userdata_red)
## Rows: 3,625
## Columns: 22
## $ user_id       <chr> "002b2e", "0059d9", "00aad3", "00c6e6", "00caa7", "00e31…
## $ noShows       <int> 31, 20, 8, 16, 16, 5, 6, 7, 5, 7, 36, 5, 8, 5, 11, 6, 8,…
## $ total_Time    <dbl> 534.355573, 446.054906, 38.004417, 148.464367, 167.23811…
## $ weekday       <dbl> 0.8709677, 0.7500000, 1.0000000, 0.8125000, 0.2500000, 1…
## $ weekend       <dbl> 0.12903226, 0.25000000, 0.00000000, 0.18750000, 0.750000…
## $ Afternoon     <dbl> 0.00000000, 0.15000000, 0.00000000, 0.00000000, 0.375000…
## $ Day           <dbl> 0.09677419, 0.45000000, 0.00000000, 0.00000000, 0.125000…
## $ Evening       <dbl> 0.8064516, 0.4000000, 0.7500000, 1.0000000, 0.5000000, 0…
## $ Night         <dbl> 0.09677419, 0.00000000, 0.25000000, 0.00000000, 0.000000…
## $ Children      <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0…
## $ Comedy        <dbl> 0.06451613, 0.00000000, 0.25000000, 0.00000000, 0.000000…
## $ Drama         <dbl> 0.3225806, 0.0000000, 0.0000000, 0.0625000, 0.8125000, 0…
## $ Entertainment <dbl> 0.06451613, 0.10000000, 0.00000000, 0.37500000, 0.000000…
## $ Factual       <dbl> 0.41935484, 0.30000000, 0.25000000, 0.50000000, 0.187500…
## $ Learning      <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.000000…
## $ Music         <dbl> 0.06451613, 0.00000000, 0.00000000, 0.00000000, 0.000000…
## $ News          <dbl> 0.03225806, 0.50000000, 0.37500000, 0.00000000, 0.000000…
## $ NoGenre       <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
## $ RelEthics     <dbl> 0.00000000, 0.00000000, 0.00000000, 0.06250000, 0.000000…
## $ Sport         <dbl> 0.03225806, 0.05000000, 0.12500000, 0.00000000, 0.000000…
## $ Weather       <dbl> 0.00000000, 0.05000000, 0.00000000, 0.00000000, 0.000000…
## $ avg_view_pct  <dbl> 0.29720926, 0.45242522, 0.11615515, 0.23324406, 0.170699…

7 Clustering with K-Means

Now we are ready to find clusters in the BBC iPlayer viewers. We will start with the K-Means algorithm.

7.1 Training a K-Means Model

Train a K-Means model. Start with 2 clusters and make sure you de-select user_id variable. Also don’t forget to scale the data. Use 50 random starts. Should we use more starts?

  • Use a higher “nstart” will increase the number of initial configurations of k centroids (in this case 2 controids). The splitting of data into the clusters is sensitive to the initial choice of centroids’ configuration. In this case, by trying different configurations and picking the best one, this bias is reduced. After increasing “nstar”, it was found that the size of the 2 clusters did not change, thus we come to the conclusion that 50 is enough and we do not need a higher number (which is computationally slightly more expensive).

Also display the cluster sizes. See the RMarkdown file from the last session to identify the R functions you need for this and the tasks below.

Use summary("kmeans Object") to examine the components of the results of the clustering algorithm. How many points are in each cluster?

userdata_train <- userdatall %>% 
  # Get rid of variables that you might not need. Do not include no shows as well because it is highly correlated with total time
  select(-user_id, -noShows, -weekday) %>% 
  #log transform total time to reduce the impact of outliers 
  mutate(total_Time = log(total_Time)) 
  
#scale the data
userdata_train2<-data.frame(scale(userdata_train))


#train kmeans clustering
model_kmeans_2clusters<-eclust(userdata_train2, "kmeans", k = 2,nstart = 50, graph = FALSE)


#add clusters to the data frame
userdata_withClusters<-mutate(userdata_train2, cluster = as.factor(model_kmeans_2clusters$cluster))
#Check the components of this object.
summary(model_kmeans_2clusters)
##              Length Class      Mode   
## cluster      9351   -none-     numeric
## centers        38   -none-     numeric
## totss           1   -none-     numeric
## withinss        2   -none-     numeric
## tot.withinss    1   -none-     numeric
## betweenss       1   -none-     numeric
## size            2   -none-     numeric
## iter            1   -none-     numeric
## ifault          1   -none-     numeric
## silinfo         3   -none-     list   
## nbclust         1   -none-     numeric
## data           19   data.frame list
#Size of the clusters

model_kmeans_2clusters$size
## [1] 5704 3647
  • There are 3647 points in one cluster and 5704 points in the other.

7.2 Visualizing the results

7.2.1 Cluster centers

Plot the normalized cluster centers. Try to describe the clusters that the algorithm suggests.

#your code here
#Plot centers for k=2

#First generate a new data frame with cluster centers and cluster numbers
cluster_centers <- data.frame(cluster=as.factor(c(1:2)),model_kmeans_2clusters$centers)

#transpose this data frame
cluster_centers_t <- cluster_centers %>% gather(variable,value,-cluster,factor_key = TRUE)

#plot the centers
graphkmeans_2clusters <- ggplot(cluster_centers_t, aes(x = variable, y = value))+  
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+     
  geom_point(size=1,shape=4)+ 
  geom_hline(yintercept=0)+
  theme_minimal() +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(size = 13, face = "bold"))+
  labs(title = "K-means Centers k=2",
       x = "Variable", y = "Mean value") +
  guides(color=guide_legend("Cluster"))
  
graphkmeans_2clusters

Can you interpret each cluster from this plot? Did you arrive at meaningful clusters?

  • The first cluster (blue) contains viewers that have a shorter total watching time and prefer to watch in the afternoon (daytime). The second cluster (pink) contains viewers that have a longer total watching time and prefer to watch in the evening (night). These clusters also have a significant difference in their preferred genres. The day-watchers (more likely to be children) like “Factual”, “Learning”, “Music”, “News” and “Sports” contents while the other cluster prefers “Comedy” and “Drama”. They are also more likely to complete the show. According to these two clusters, we can roughly conclude that the active users like to use iPlayer during night time and like entertaining contents.

How can you use the cluster information to improve the viewer experience with BBC iPlayer? We will come back to these points below. However it is important to think about these issues at the beginning.

  • BBC can use the cluster information to build a recommendation system which can advertise TV series that are likely to be welcomed by the major cluster at different time of day and also on different day of week. For example, based on which cluster the customer is in or is likely to be in, BBC can personalise the homepage using the characteristics of the cluster. For cluster 1 in which viewers like to watch drama at evenings, they can push notifications to them later in the day at when they are likely to be active and recommend contents that are liked by their “neighbours” in the cluster.

7.2.2 Clusters vs variables

Plot a scatter plot for the viewers with respect to total_Time and weekend variables with color set to the cluster number of the user. What do you observe? Which variable seems to play a more prominent role in determining the clusters of users?

#your code here

ggplot(userdata_withClusters, 
       aes(x = total_Time, y = weekend, color =  as.factor(cluster))) +
  geom_jitter()+
  labs(color = "Cluster") +
  labs(title = "The two clusters have a prominent difference in total time of watching",
       subtitle = "Scatter plot of whether the viewers' have watched on weekends and their total watching time ",
       x = "Total time", y = "Weekend 
       (+ve means more time spent watching on weekends)") +
  theme_minimal() +
  theme(text = element_text(size=10),
        plot.title = element_text(size = 13, face = "bold")) 

# Note that geom_jitter adds a small noise to each observation so that we can see overlapping points
  • If we look across the horizontal axis (total time), the blue dots are majorly on the right while the red dots are distributed further relatively on the left. If we look across the vertical axis (weekend), there is not much difference in their distributions. This indicates that compared to total time, weekend is a less prominent factor in differentiating the 2 clusters.

7.2.3 Clusters vs PCA components

Repeat the previous step and use the first two principle components using fviz_cluster function.

#your code here
fviz_cluster(model_kmeans_2clusters, userdata_train2, palette = "Set2", ggtheme = theme_minimal()) +
  labs(title = "Principle component analysis of the 2-cluster model with log transformation") +
  theme(text = element_text(size = 10),
        plot.title = element_text(size = 12, face = "bold", vjust = 2)) 

7.2.4 Clusters vs PCA components without log transform

As a “side exercise”, use K-means method again but this time do not log transform total time and include no_shows as well. Compare your results to the case when you use log transformation. Then visualize the first two principle components using fviz_cluster function.

#your code here
userdata_train3 <- userdatall %>% 
  # Get rid of variables that you might not need
  select(-user_id, -weekday)
  
#scale the data
userdata_train4 <- data.frame(scale(userdata_train3))



#train kmeans clustering
model2_kmeans_2clusters_no_log <- eclust(userdata_train4, "kmeans", k = 2,nstart = 50, graph = FALSE)


#add clusters to the data frame
userdata_withClusters2 <- mutate(userdata_train4, cluster = as.factor(model2_kmeans_2clusters_no_log$cluster))

#plot two principle components
fviz_cluster(model2_kmeans_2clusters_no_log, userdata_train4, palette = "Set2", ggtheme = theme_minimal()) +
  labs(title = "Principle component analysis of the 2-cluster model without log transformation") +
  theme(text = element_text(size=10),
        plot.title = element_text(size = 12, face = "bold", vjust = 2)) 

paste0("Information of the 2-cluster model with log transformation")
## [1] "Information of the 2-cluster model with log transformation"
summary(model_kmeans_2clusters)
##              Length Class      Mode   
## cluster      9351   -none-     numeric
## centers        38   -none-     numeric
## totss           1   -none-     numeric
## withinss        2   -none-     numeric
## tot.withinss    1   -none-     numeric
## betweenss       1   -none-     numeric
## size            2   -none-     numeric
## iter            1   -none-     numeric
## ifault          1   -none-     numeric
## silinfo         3   -none-     list   
## nbclust         1   -none-     numeric
## data           19   data.frame list
paste0("Size of clusters:")
## [1] "Size of clusters:"
model_kmeans_2clusters$size
## [1] 5704 3647
paste0("Information of the 2-cluster model without log transformation")
## [1] "Information of the 2-cluster model without log transformation"
summary(model2_kmeans_2clusters_no_log)
##              Length Class      Mode   
## cluster      9351   -none-     numeric
## centers        40   -none-     numeric
## totss           1   -none-     numeric
## withinss        2   -none-     numeric
## tot.withinss    1   -none-     numeric
## betweenss       1   -none-     numeric
## size            2   -none-     numeric
## iter            1   -none-     numeric
## ifault          1   -none-     numeric
## silinfo         3   -none-     list   
## nbclust         1   -none-     numeric
## data           20   data.frame list
paste0("Size of clusters:")
## [1] "Size of clusters:"
model2_kmeans_2clusters_no_log$size
## [1] 3811 5540

Do you observe any outliers?

  • Yes.

  • The two models have split the data into two clusters of consistent size. Whether using a log transformation does not seem to affect the result of k-means clustering. However, the PCA diagrams are different. In the previous model, the log transformation has both enlarged the difference between the two clusters on Dim1 and shortened that within the clusters. This has create a well separated PCA diagram in which the two clusters can be clearly distinguished. In the above model, Dim1 (likely to be total time) has a larger span and cluster 2 has taken most of the range on Dim1, squeezing cluster 1 to the far left. While the majority of point in cluster 2 are also accumulated on the top left corner, the points in the middle and on the bottom right become outliers. Not using a log transformation has led to less explanatory power for the clusters, as according to PCA, not all of the points in cluster 2 are closely related.

7.3 Elbow Chart

Produce an elbow chart and identify a reasonable range for the number of clusters.

#your code here
#Here is a short way of producing the elbow chart using "fviz_nbclust" function. 
fviz_nbclust(userdata_train2,kmeans, method = "wss")+
  labs(subtitle = "Elbow method")

7.4 Silhouette method

Repeat the previous step for Silhouette analysis.

#your code here
fviz_nbclust(userdata_train2, kmeans, method = "silhouette",k.max = 15)+
  labs(subtitle = "Silhouette method")

Question 5: Summarize the conclusions of your Elbow Chart and Silhoutte analysis. What range of values for the number of clusters seems more plausible?

  • For the first elbow plot, SSE did not stailise within k = 10, thus we will be looking for points in the middle to ensure that we can have a relatively low SSE while not over fitting the data. In Silhouette analysis, we will be looking for points with high silhoutte width but still keep in mind the problem of over fitting. Overall, the best possible values located for k are 4 and 6.

7.5 Comparing k-means with different k

Question 6: For simplicity let’s focus on lower values. Now find the clusters using kmeans for k=3, 4 and 5. Plot the centers and check the number of observations in each cluster. Based on these graphs which one seems to be more plausible? Which clusters are observable in each case? Don’t forget to check the cluster sizes.

#your code here
userdata_train <- userdatall %>% 
  # Get rid of variables that you might not need. Do not include no shows as well because it is highly correlated with total time
  select(-user_id, -noShows, -weekday) %>% 
  #log transform total time to reduce the impact of outliers 
  mutate(total_Time = log(total_Time)) 
  
#scale the data
userdata_train2<-data.frame(scale(userdata_train))


#Fit kmeans models with 3 clusters
#train kmeans clustering
model_kmeans_3clusters<-eclust(userdata_train2, "kmeans", k = 3,nstart = 50, graph = FALSE)
#add clusters to the data frame
userdata_with3Clusters<-mutate(userdata_train2, cluster = as.factor(model_kmeans_3clusters$cluster))



#Fit kmeans models with 4 clusters
#train kmeans clustering
model_kmeans_4clusters<-eclust(userdata_train2, "kmeans", k = 4,nstart = 50, graph = FALSE)
#add clusters to the data frame
userdata_withClusters<-mutate(userdata_train2, cluster = as.factor(model_kmeans_4clusters$cluster))


#Fit kmeans models with 5 clusters
#train kmeans clustering
model_kmeans_5clusters<-eclust(userdata_train2, "kmeans", k = 5,nstart = 50, graph = FALSE)
#add clusters to the data frame
userdata_withClusters<-mutate(userdata_train2, cluster = as.factor(model_kmeans_5clusters$cluster))
#check the size of clusters for each model
paste0("Size of cluster at k = 3:")
## [1] "Size of cluster at k = 3:"
model_kmeans_3clusters$size
## [1] 1034 3124 5193
paste0("Size of cluster at k = 4:")
## [1] "Size of cluster at k = 4:"
model_kmeans_4clusters$size
## [1] 4482 2351 1222 1296
paste0("Size of cluster at k = 5:")
## [1] "Size of cluster at k = 5:"
model_kmeans_5clusters$size
## [1]  395 2646 1045 1149 4116
  • The difference between cluster size decreases as we increase the value of k. When k is 3, the clusters have more “unique” characteristics than when k is 5. Having more clusters is like splitting the large clusters into smaller ones. Therefore, the data points within the large clusters will not be as distinguishable as points across clusters.
#PCA visualizations

# plots to compare
#I use the fviz_cluster function which is part of the`factoextra` library
p1 <- fviz_cluster(model_kmeans_3clusters, geom = "point", data = userdata_train2) + 
  ggtitle("Principle component analysis at k = 3") +
  theme_minimal() +
  theme(text = element_text(size=10),
        plot.title = element_text(size = 12, face = "bold", vjust = 2)) 

p2 <- fviz_cluster(model_kmeans_4clusters, geom = "point", data = userdata_train2) + 
  ggtitle("Principle component analysis at k = 4") +
  theme_minimal() +
  theme(text = element_text(size=10),
        plot.title = element_text(size = 12, face = "bold", vjust = 2))

p3 <- fviz_cluster(model_kmeans_5clusters, geom = "point", data = userdata_train2) + 
  ggtitle("Principle component analysis at k = 5") +
  theme_minimal() +
  theme(text = element_text(size=10),
        plot.title = element_text(size = 12, face = "bold", vjust = 2))

library(gridExtra)
grid.arrange(p1, p2,p3, ncol = 1) 

  • In the above PCA plots, we can see that as we increase the number of clusters, there are more and more overlap between the clusters. When k = 3, the range of each cluster is so big that some points are not strongly related to the others within the same cluster. When k = 5, cluster 1, 2, 4 and 5 largely overlap (thus can be very similar) and it will also be difficult to interpret the characteristics of the clusters. The middle plot (k = 4) seem most reasonable among the three plots. Although there is still overlap present, the clusters are relatively more “independent” and the sizes of the clusters are more “reasonable”.

  • It has been noticed that some clusters are constantly present in each model. For example, cluster 3 in the top plot is very similar to cluster 1 in the middle plot and cluster 2 in the bottom plot. Cluster 2 in the top plot is similar to cluster 2 in the middle plot and cluster 4 in the bottom plot.

#Plot centers
#your code here

#Plot centers for k=3
#First generate a new data frame with cluster centers and cluster numbers
cluster_centers <- data.frame(cluster=as.factor(c(1:3)),model_kmeans_3clusters$centers)
#transpose this data frame
cluster_centers_t <- cluster_centers %>% gather(variable,value,-cluster,factor_key = TRUE)
#plot the centers
graphkmeans_3clusters <- ggplot(cluster_centers_t, aes(x = variable, y = value))+  
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ 
  geom_point(size=1,shape=4)+
  geom_hline(yintercept=0)+
  theme_minimal() +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(size = 13, face = "bold"))+
  labs(title = "K-means Centers k=3",
       x = "Variable", y = "Mean value") +
  guides(color=guide_legend("Cluster"))

  
  

#Plot centers for k=4
#First generate a new data frame with cluster centers and cluster numbers
cluster_centers <- data.frame(cluster=as.factor(c(1:4)),model_kmeans_4clusters$centers)
#transpose this data frame
cluster_centers_t <- cluster_centers %>% gather(variable,value,-cluster,factor_key = TRUE)
#plot the centers
graphkmeans_4clusters <- ggplot(cluster_centers_t, aes(x = variable, y = value))+  
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ 
  geom_point(size=1,shape=4)+
  geom_hline(yintercept=0)+
  theme_minimal() +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(size = 13, face = "bold"))+
  labs(title = "K-means Centers k=4",
       x = "Variable", y = "Mean value") +
  guides(color=guide_legend("Cluster"))

#Plot centers for k=5
#First generate a new data frame with cluster centers and cluster numbers
cluster_centers <- data.frame(cluster=as.factor(c(1:5)),model_kmeans_5clusters$centers)
#transpose this data frame
cluster_centers_t <- cluster_centers %>% gather(variable,value,-cluster,factor_key = TRUE)
#plot the centers
graphkmeans_5clusters <- ggplot(cluster_centers_t, aes(x = variable, y = value))+  
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ 
  geom_point(size=1,shape=4) +
  geom_hline(yintercept=0) +
  theme_minimal() +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(size = 13, face = "bold"))+
  labs(title = "K-means Centers k=5",
       x = "Variable", y = "Mean value") +
  guides(color=guide_legend("Cluster"))

grid.arrange(graphkmeans_3clusters, graphkmeans_4clusters,graphkmeans_5clusters, ncol = 1)

  • From the above plot, we can again sense the repetition of clusters as we increase the value of k. Cluster 2 at k = 3 is similar to cluster 2 at k = 4 and cluster 1 at k = 5. Cluster 3 at k = 3 is similar to cluster 1 at k = 4 and cluster 2 at k = 5.

8 Comparing results of different clustering algorithms

8.1 PAM

Fit a PAM model for the k value you chose above for k-means. Determine how many points each cluster has. Plot the centers of the clusters and produce PCA visualization.

#your code here
#PAM clustering with `k` as the number of clusters
k=4
k4_pam <-eclust(userdata_train2, "pam", k = k, graph = FALSE)
#Let's see the cluster sizes
k4_pam$medoids
##      total_Time     weekend  Afternoon         Day   Evening       Night
## [1,]  0.7003162  0.05452185 -0.1178371 -0.12431497 0.2395850 -0.04878912
## [2,]  0.9756600  0.29086402 -0.3188525 -0.08086839 0.1559089  0.22112161
## [3,]  1.0547551 -0.41650976 -0.1509715 -0.37496830 0.4326837  0.05502270
## [4,] -0.8600069  0.37680663 -0.5485844  0.52738371 0.2395850 -0.49864034
##        Children     Comedy      Drama Entertainment    Factual    Learning
## [1,] -0.2239897 -0.3807217 -0.3933940    -0.3947571  1.0861689 -0.08799631
## [2,] -0.2239897  0.2836991 -0.7318539     2.5797746 -0.3306575 -0.08799631
## [3,] -0.2239897  0.3859177  1.1426930    -0.3947571 -0.2715461 -0.08799631
## [4,] -0.2239897 -0.3807217 -0.7318539    -0.3947571 -0.7148816 -0.08799631
##          Music       News    NoGenre   RelEthics      Sport    Weather
## [1,] -0.188278  0.6660690 -0.1184595 -0.07779578 -0.4239726 -0.1803931
## [2,] -0.188278 -0.4148573 -0.1184595 -0.07779578 -0.1963741 -0.1803931
## [3,] -0.188278 -0.4148573 -0.1184595 -0.07779578 -0.4239726 -0.1803931
## [4,] -0.188278 -0.4148573 -0.1184595 -0.07779578  2.9900050 -0.1803931
##      avg_view_pct
## [1,]  -0.13324254
## [2,]  -0.02449019
## [3,]   0.12242029
## [4,]  -0.92374051
#First we generate a new data frame with cluster medoids and cluster numbers
cluster_medoids<-data.frame(cluster=as.factor(c(1:k)),k4_pam$medoids)

#then transpose this data frame
cluster_medoids_t<-cluster_medoids %>% gather(variable,value,-cluster,factor_key = TRUE)

#plot medoids
graphkmeans_3Pam<-ggplot(cluster_medoids_t, aes(x = variable, y = value))+  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=1,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+ggtitle("Pam Medoids k=3")

graphkmeans_3Pam

#We can also visualize the centers (instead of the medoids) to make a more fair comparison
userdata_withClusters<-mutate(userdata_train2, 
                                   cluster = as.factor(k4_pam$cluster))

center_locations <- userdata_withClusters%>% group_by(cluster) %>% summarize_at(vars(total_Time:Weather),mean)

#Next we use gather to collect information together
xa2p<- gather(center_locations, key = "variable", value = "value",-cluster,factor_key = TRUE)

#then we use ggplot to visualize centers
pamcenters<-ggplot(xa2p, aes(x = variable, y = value))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=2,shape=4)+geom_hline(yintercept=0)+ggtitle(paste("PAM Centers k=",k))+labs(fill = "Cluster")+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+scale_colour_manual(values = c("darkgreen", "orange", "red","blue"))


pamcenters

fviz_cluster(model_kmeans_4clusters, data = userdata_train2) + ggtitle("K-means k = 4")+scale_colour_manual(values = c("darkgreen", "orange", "red","blue"))

8.2 H-Clustering

Use Hierercahial clustering with the same k you chose above. Set hc_method equal to average and then ward.D. What differences do you observe between the results of these two methods? Visualize the results using dendrograms. How many points does each cluster have? Plot the centers of the clusters and produce PCA visualization.

#your code here
#First we find the distances between points.Then we determine how to form the clusters
res.dist <- dist(userdata_train2, method = "euclidean")

res.hc <-  hcut(res.dist, hc_method = "ward.D",k=4)
summary(res.hc)
##             Length   Class  Mode     
## merge          18700 -none- numeric  
## height          9350 -none- numeric  
## order           9351 -none- numeric  
## labels             0 -none- NULL     
## method             1 -none- character
## call               3 -none- call     
## dist.method        1 -none- character
## cluster         9351 -none- numeric  
## nbclust            1 -none- numeric  
## silinfo            3 -none- list     
## size               4 -none- numeric  
## data        43715925 dist   numeric
fviz_silhouette(res.hc)
##   cluster size ave.sil.width
## 1       1 6313          0.05
## 2       2 1094          0.27
## 3       3 1347          0.19
## 4       4  597         -0.13

plot(res.hc,hang = -1, cex = 0.5)

Plot the centers of H-clusters and compare the results with K-Means and PAM.

#your code here

#First let's find the averages of the variables by cluster
userdata_withClusters<-mutate(userdata_train2, 
                                   cluster = as.factor(res.hc$cluster))

center_locations <- userdata_withClusters%>% group_by(cluster) %>% summarize_at(vars(total_Time:Weather),mean)

#Next I use gather to collect information together
xa2<- gather(center_locations, key = "variable", value = "value",-cluster,factor_key = TRUE)

#Next I use ggplot to visualize centers
hclust_center<-ggplot(xa2, aes(x = variable, y = value,order=cluster))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=2,shape=4)+geom_hline(yintercept=0)+ggtitle("H-clust K=4")+labs(fill = "Cluster")+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+scale_colour_manual(values = c("darkgreen", "orange", "red","blue")) 
## Compare it with KMeans
hclust_center

Question 7: Based on the results of these three methods, what can you conclude?

While the kmeans algorithm identifies 4 distinct clusters, this is not clearly visible using hierarchical clusters of the H-Clust algorithm. The partitioning around the mediods algorithm also confirms 4 distinct clusters. As we have concluded from our silhoutte and elbow method charts, we again find that we are best served with k=4, thereby segmenting the iBBC viewership among four distinct viewer groups which we can target with seperate content and adverts.

9 Subsample check

At this stage you must have chosen the number of clusters. We will try to reinforce your conclusions and verify that they are not due to chance by dividing the data into two equal parts. Use K-means clustering, fixing the number of clusters to your choice, in these two data sets separately. If you get similar looking clusters, you can rest assured that you conclusions are robust. If not you might want to reconsider your decision.

library(rsample)
#the following code chunk splits the data into two. Replace ... with your data frame that contains the data
set.seed(2345)
train_test_split <- initial_split(userdata_train2, prop = 0.5)
testing <- testing(train_test_split) #50% of the data is set aside for testing
training <- training(train_test_split) #50% of the data is set aside for training

#Fit k-means to each dataset and compare your results

model_kmeans_testing<-eclust(testing, "kmeans", k = 4,nstart = 50, graph = FALSE)

model_kmeans_training<-eclust(training, "kmeans", k = 4,nstart = 50, graph = FALSE)

cluster_centers_test <- data.frame(cluster=as.factor(c(1:4)),model_kmeans_testing$centers)

#transpose this data frame
cluster_centers_testing <- cluster_centers_test %>% gather(variable,value,-cluster,factor_key = TRUE)

#plot the centers
graphkmeans_testing <- ggplot(cluster_centers_testing, aes(x = variable, y = value))+  
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+     
  geom_point(size=1,shape=4)+ 
  geom_hline(yintercept=0)+
  theme_minimal() +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(size = 13, face = "bold"))+
  labs(title = "K-means Centers k=2",
       x = "Variable", y = "Mean value") +
  guides(color=guide_legend("Cluster"))
  
graphkmeans_testing

cluster_centers_train <- data.frame(cluster=as.factor(c(1:4)),model_kmeans_training$centers)

#transpose this data frame
cluster_centers_training <- cluster_centers_train %>% gather(variable,value,-cluster,factor_key = TRUE)

#plot the centers
graphkmeans_training <- ggplot(cluster_centers_training, aes(x = variable, y = value))+  
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+     
  geom_point(size=1,shape=4)+ 
  geom_hline(yintercept=0)+
  theme_minimal() +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(size = 13, face = "bold"))+
  labs(title = "K-means Centers k=2",
       x = "Variable", y = "Mean value") +
  guides(color=guide_legend("Cluster"))
  
graphkmeans_training

Question 8: Based on the results, what can you conclude? Are you more or less confident in your results?

Given that our randomly generated test and training data produce similar clusters, we can be confident in the results obtained in our analysiss above.

10 Conclusions

Question 9: In plain English, explain which clusters you can confidently conclude that exist in the data, based on all your analysis in this exercise.

Do you think you chose the right k? Explain you reasoning.

What assumptions do you think your results are sensitive to? How can you check the robustness of your conclusions? Just explain, you don’t have to carry out the analysis.

Finally explain how the information about these clusters can be used to improve viewer experience for BBC or other online video streaming services.